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method to evaluate complex classroom observations that captures the salient 
features of reform-based teaching is described. (KHR) 
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Observing Teaching: A reform-based framework for looking into classrooms 

Introduction 

Teaching and its measure are critical components of efforts to promote student learning 
(National Commission on Teaching and America's Future, 1996). Reformers exploring ways to 
support exemplary teaching and thus advance student learning, depend on suitable methods for 
gathering information on what is effective. In our work we are exploring the use of reform-based 
curriculum materials to promote exemplary teaching in science. We have designed explicit 
support for teachers to learn about teaching within our materials (Schneider & Krajcik, 2002). 
This work has led us to examine teachers’ classroom practices in response to the support for 
teacher thinking in the materials. Observation of classroom teaching is essential to improve our 
understanding of how to help teachers learn and enact reform-based practices (Anderson, 2001). 
However, the rich descriptions provided by qualitative methods are time and labor intensive 
necessitating the observation of only a few teachers. Data from a variety of classrooms is needed 
to develop truly effective programs. We also are interested in the scalability of our teacher 
educative materials and therefore a measure of teaching that is less cumbersome than detailed 
descriptions of classroom events. Although many quantitative measures are feasible on a large 
scale they fail to capture the true complexity of what happens in classrooms. In this paper we 
describe the development of a method to evaluate complex classroom observations that captures 
the salient features of reform-based teaching and is feasible on a larger scale. 

Observing teaching 

Based on goals for student learning in science, reformers are exploring new ways to help 
teachers learn how to use inquiry with collaboration supported by use of technology tools to 
support students in actively constructing deep understanding of important science concepts 
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(National Research Council, 1996). One new idea is to include explicit support for teachers to 
learn about teaching within curriculum materials making them educative for teachers (Ball & 
Cohen, 1996). This strategy has the potential to facilitate instructional improvement on a large 
scale but lacks specific design ideas and empirical evidence that it can be effective. Research in 
classrooms is needed to guide the design and improvement of educative materials. 

We have developed science materials to reflect desired reforms and provide teachers with 
needed support to learn and enact innovative curriculum as part of an ongoing systemic initiative 
of a large urban public school district. Developers created materials based on the premises of 
project-based science and were guided by design principles that include: contextualization, 
alignment with standards, sustained student inquiry, embedded learning technologies, 
collaboration and discourse, assessment techniques, and scaffolds and supports for teachers 
(Krajcik, Czemiak, & Berger, 2002; Schneider & Krajcik, 2002; Singer, Marx, Krajcik, & Clay- 
Chambers, 2000). Materials were designed to be educative by including detailed lesson 
descriptions that addressed necessary content, pedagogy, and pedagogical content knowledge for 
teachers. We were interested in a scalable method of analyzing classroom enactment data to gain 
meaningful information on which to base revisions of curriculum materials and improve support 
for teachers in learning and enacting new instructional practices. 

Researchers interested in understanding how to support improved teaching consider 
classroom observation essential to determining the success of their efforts to change teachers' 
practice (Blumenfeld, Krajcik, Marx, & Soloway, 1994; Palincsar, Magnusson, Marano, Ford, & 
Brown, 1998; Wood, Cobb, & Yackel, 1991). This requires careful observation and analysis of 
classroom events including teachers' behaviors and statements. Typically this process extends to 
only one to four classrooms over a period of one to four years. Therefore, we have only 
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speculative knowledge as to why many wonderful curriculum ideas fail to be realized in 
classrooms beyond the initial implementations (Brown, 1992). 

Researchers attempting to identify specific factors that influence student achievement are 
examining national and state level data. From this work we have some evidence that the quality 
of teaching is related to student outcomes. For instance, Darling-Hammond (1999) examined 
state level data on teacher preparation, certification, and experience along with changes in 
student achievement over several years. She describes teacher professional development as the 
most important means to improving student achievement scores. However, this approach does 
not identify what these teachers are doing in the classroom to impact student learning. Likewise, 
work by Sanders and Horn (1 998) indicates a long term affect of individual teachers on student 
achievement scores. But again this work does not describe what this quality teaching looks like 
in a classroom. Therefore, these studies cannot point to the features of teacher preparation, 
knowledge, or experience that are particularly worthwhile. This leaves the topic of how to 
improve teaching and student outcomes open to debate (Cochran-Smith & Fries, 2001). 

One approach that merits further development is classroom observation research that 
links specific curriculum to teachers' instruction (Collopy, 1999; Prawat, 1992; Remillard, 1999). 
In these studies, analysis of observations is guided by frameworks based on recommended 
curriculum. This approach is facilitated when researchers describe their curriculum in terms of 
reform guidelines. The reform-based curriculum that is the focus of this study is one such 
example (see Schneider & Krajcik, 2002; Singer et al., 2000). 

An better understanding of how teachers and students interact around specific materials 
and ideas in classrooms is needed (Ball, 2000). Teaching is a thinking practice; it is not enough 
to measure teachers’ knowledge or behaviors independently (Lampert, 1998). We need to look 
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into classrooms to observe teachers’ practices in light support for teacher thinking. Similarly, 
measures of student achievement matched to specific curriculum are more likely to capture the 
impact on student achievement than general measures. (Ruiz-Primo, Shavelson, Hamilton, & 
Klein, 2002). General measures of student achievement do not necessarily indicate what students 
have learned in the classroom. In this study we used a reform-based science curriculum unit to 
develop a framework for observing teaching and linked our observation results to student 
outcomes also measured closely to the curriculum project. 

Research design 

The goal of this research was to design a systematic research method for observing 
classroom teaching that was consistent with reform recommendations and adaptable to use on a 
large scale. To support teachers in science reform, project-based materials were developed to 
address important science ideas, offer multiple learning opportunities, and provide appropriate 
instructional supports for students. The materials also incorporate ideas about how and what 
teachers need to learn to enact innovative curriculum. Materials include detailed lesson 
descriptions to assist teachers in enactment. Features to address the learning needs of teachers 
offer information to explain content and pedagogy, as well as specific information about 
strategies, representations, and students' ideas (PCK) embedded within lessons (Schneider & 
Krajcik, 2002). In order to determine if educative materials were indeed helpful for teachers 
enacting project-based science a careful qualitative study was conducted. The results of that 
work have been reported in two other papers (Schneider, Blumenfeld, & Krajcik, 2002; 

Schneider, Krajcik, & Blumenfeld, 2002). Teachers' enactments of these lesson sequences were 
examined and characterized across lessons and teachers in light of the instructional practices 
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recommended in the materials. One outcome of this work is the beginning of a scalable scoring 
rubric to measure quality of teaching in comparison to curriculum goals. 

Methods 

Background 

This study was conducted in four urban middle schools located in low SES 
neighborhoods selected to participate in initial stages of the reform effort (Krajcik, Marx, 
Blumenfeld, Soloway, & Fishman, 2000). Students in these schools were predominantly African 
American (95% to 100%) with high percentages of students receiving free or reduced lunch 
(29% to 66%). Scores on local and statewide achievement testing in science were reported as 
below grade level in three of the four schools. 

Curriculum material development was considered an essential component of the change 
effort, particularly to facilitate change within classrooms on a large scale (Blumenfeld, Fishman, 
Krajcik, Marx, & Soloway, 2000; Singer et al., 2000). The project-based science curriculum 
materials used by teachers in this study were developed as part of the larger reform effort. As a 
researcher and curriculum developer, the first author took a lead role in designing these materials 
to support both students and teachers in the transition to inquiry based science instruction (see 
Schneider & Krajcik, 2002; Schneider, Krajcik et al., 2002). However, the educative features of 
the materials were only one part of the professional development involved in this reform effort 
(Fishman & Best, 2000). 

Four eighth-grade teachers participating in the reform effort used materials for a ten-week 
unit on force and motion. Teaching experience ranged from 6 to 20 years. Prior to enacting this 
unit, each of the four teachers had limited experience with one or more of the following aspects: 
project-based science, physics, or the use of technological tools to support inquiry. Although 
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they were not selected as a statistically random sample, their disparate backgrounds made this 
group representative of middle school science teachers across the district. 

Data Collection and Preparation 

Target lesson sequences. Five target lesson sequences containing experiences with 
phenomena, investigation, technology use, or artifact development, sparming 3-5 days each were 
selected for analysis. These lesson sequences were selected because each represented different 
aspects of inquiry teaching that were to be used to focus descriptions of classroom enactments. 
These aspects included how teachers a) presented science ideas, b) promoted students' use of 
inquiry, c) used technology to promote student inquiry and concept development, d) used 
collaboration to promote student inquiry and concept development, and e) supported and 
assessed concept development through student artifacts. 

Materials descriptions. Summary descriptions of the materials were created to guide 
analysis of classroom enactments. Text relevant to each of the five target lesson sequences was 
selected. Text was coded for categories relevant to supporting student learning. These included: 
science ideas, contextualization, representation, strategy, suggested instructional supports, 
collaboration, and artifact development. The coded text was summarized to describe the intended 
enactments in terms of the goals for student learning, the opportunities for student learning, and 
suggested instructional supports for each minor and major learning opportunity identified from 
the text. 

Enactment descriptions. Detailed descriptions of classroom events were written from the 
videotape for each target lesson sequence and teacher. Teacher and student behavior and 
conversation were described in light of the lesson sequence descriptions in the materials. As 
these descriptions were prepared, we looked for and described: 1) science ideas (content and 
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process ideas presented), 2) contextualization (referring to the driving question or anchor ideas, 
using real life examples, stating value), 3) linking ideas to previous or future lessons or to other 
ideas, 4) directions given, 5) emphasis given — such as what ideas or tasks are important, 6) 
specific strategies such as POE, 7) specific representations such as motion graphs, 8) scaffolding 
(modeling, coaching, feedback, or asking for justifications or reasons), and 9) group work 
(teacher statements on group work, teacher role during group work). We also noted suggested 
lesson sequences or portions of lesson sequences that were enacted, omitted, or adapted. Finally, 
descriptions of instruction were aligned with the intended opportunities for student learning as 
identified in the description of the materials and labeled accordingly. 

Data Analysis 

The coding scheme used was designed to capture three aspects of enactment — 
presentation of science ideas, opportunities for student learning, and support to enhance the 
learning opportunities — each in comparison to what was intended in the materials, coding 
schemes used in this analysis were developed through an iterative process of creating codes, 
coding, modifying and refining codes, and recoding consistent with Miles and Huberman’s 
(1994) recommendations for rigorous and meaningful qualitative data analysis. The independent 
coding of several enactment episodes by another science education researcher assessed reliability 
of the coding process. Reliability was 88%. After the categories and rating levels were finalized 
and reliability established, all enactment data were recoded with the final codes. 

The final coding scheme assessed instructional events in the following groups of 
categories 1) accuracy and completeness of the science ideas presented, 2) the amount student 
learning opportunities, similarity of learning opportunities with those intended, and quality of the 
adaptations, and 3) the amount of instructional supports offered, the appropriateness of the 
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instructional supports and the source of ideas for instructional supports. Each enactment episode 
was rated in each category according to the descriptions listed in Table 1 for each rating level. 
Entire episodes and the type of activity were considered to assign a rating for each category. A 
short statement of evidence or justification was written for each assigned rating. 

Assigning ratings. The categories of accuracy and completeness were included to capture 
information about the science ideas presented by teachers. Both content and process ideas were 
considered as well as whether the ideas presented were defined as a main or minor idea. The 
main ideas were defined as those identified in the purpose, objectives, or assessments of the 
materials for that lesson sequence. Likewise, minor ideas were defined as ideas secondary, 
related, or supporting the main ideas. Teachers presented ideas in a variety of ways. This 
included teachers' statements, examples, demonstrations, hints, or other types of guidance 
regarding science ideas. A teacher's response or lack of response to students' actions or 
statements was also judged as giving students information about science ideas. In this case, a 
teacher may not have directly stated ideas accurately or inaccurately but, by the type of response 
they gave, implied that inaccurate student statements were acceptable or vise versa. Each type of 
presentation of all ideas was considered when rating both accuracy and completeness. 

The rating of accuracy was unrelated to the rating of completeness. A rating of scientific 
for accuracy but incomplete or insufficient for completeness was possible and occurred. Also, 
unlike any other category, completeness included one rating that could apply in addition to the 
other ratings. This was the rating of excessive. This rating was used to indicate content related 
but beyond that intended for students in this unit. A teacher could be incomplete in covering the 
intended content, yet also excessive by adding other related content. For example, in a lesson on 
velocity a teacher might not address the intended ideas of speed and direction as components of 
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motion but might include the formula to calculate speed, which was not intended. This would be 
rated as incomplete and excessive. 

The categories of opportunities, similarity, and adaptation each refer to the learning 
opportunities for students observed in the episode. Opportunities for student learning included 
both teacher lead and small-group activities. Take-home activities that were incorporated into 
class activities were included as opportunities, but work completed entirely at home was not. The 
number of activities and the amount time devoted to these activities was considered in light of 
how the enactment episode was segmented. Episodes were not given lower ratings because the 
enactment was divided into several short segments. Opportunities were rated high if the number 
and time spent was high in relationship to the amount of class time represented in the episode. 

Similarity was rated by considering both that opportunities observed were intended by the 
materials, but also that they were in a similar sequence with approximately the same emphasis. 
For example, if a teacher directed students to make a prediction, but did not allow time for 
writing the predictions or for sharing some of the predictions in class before the observation 
phase, similarity would be rated low. 

Adaptations were opportunities provided that were not described in the materials. These 
activities were judged on whether or not they addressed content specified for the learning 
sequence and if the activity was likely to help students learn the content. Replacing a discussion 
of observed phenomena with a drill and practice to define terms would be rated as low. The 
terms may be the ones intended for use but understanding of relationships or application of ideas 
was the intended learning goal rather than the memorization of definitions. On the other hand, 
making an investigation more open by allowing students more choices in what to test would be 
rated high if students appeared to be ready to design an investigation with reduced structure. 
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The categories of instructional supports, appropriateness, and sources each refer to the 
instructional support for student thinking observed in the episode. Instructional supports included 
wide variety of teacher actions and statements that had the potential to enhance the learning 
opportunities. These included supports for student thinking as well as supports for organizing 
and carrying out tasks. Examples included, but were not necessarily limited to: modeling thought 
processes or actions, coaching, giving hints, using examples, monitoring small-group work, 
giving reminders, asking for reasons or justification, structuring student work, offering guidance, 
and giving feedback. 

Instructional supports were rated high if the number of supports was high. Whether or not 
the supports appeared to be of a type that would help students learn the intended science content 
was judged in the category of appropriateness. Therefore, an episode could be rated high for 
supports if a teacher gave students many hints, but poor for appropriateness if those hints were 
likely to lead students in the wrong direction or did not match the type of difficulty students were 
exhibiting. The category of source was rated as matched when teachers used only supports that 
were suggested in the material for that lesson sequence. If the support was suggested in another 
lesson sequence, source was rated replaced or supplemented. If teachers used only supports not 
suggested in the materials, source was rated replaced. Supplemented was a rating used when 
supports of both types were observed. 

Summarizing ratings. Ratings were then summarized across opportunities for each lesson 
sequence. The ratings and the justification statements in each category were compared 
sequentially for all enactment episodes already rated by opportunity. Then a judgment was made 
for a rating of the entire lesson sequence. A justification statement was also written for each 
lesson sequence rating based on a summary of the individual statements. To guide the 
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summarization process a set of guidelines were developed. When variation was evident, 
summarizing was done in a way that appropriately reflected the variation in the final rating and 
justification statement. If the variation was minor, one rating was given but the variation was 
described in the justification statement. However, when variation was more pronounced, two or 
more ratings were assigned and the lesson sequence was labeled as varied. Again the justification 
statement described the variation. 

The final analysis phase was to examine the coded lesson sequences for patterns across 
lesson sequences and teachers. Each category was traced across all lesson sequences for each 
teacher. During this examination, justifications for the ratings were also examined for patterns. 
Data also were examined in the same way for patterns across teachers. Summarizing across all 
teachers was not possible. However, summary ratings and justification statements were 
appropriate when teachers were placed into two enactment groups. 

Student Achievement Measures 

As part of the larger research effort in which this study was embedded, written 
assessment instruments were developed to assess student understanding of the curriculum 
content and science process skills (Krajcik et al., 2000). The assessments were administered to 
each student participating in the curriculum projects. The assessments consisted of a combination 
of multiple choice and free response items that were further classified as either curriculum 
content knowledge or science process skill items. Content and process items were categorized by 
one of three cognitive levels required for arriving at a complete answer: lower (recalling 
information; understanding simple and complex information); middle (drawing or understanding 
simple relationships; applying knowledge to new or different situations; shifting between 
representations such as verbal to graphic; identifying hypotheses, procedures, results, or 
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conclusions); and higher (describing or analyzing data from charts and graphs; framing 
hypotheses; drawing conclusions; defining or isolating variables given in a scenario; applying 
investigation skills; and using concepts to explain phenomena). The curriculum development 
teams (including science educators, content specialists, educational psychologists, and classroom 
teachers) constructed the tests. We analyzed all potential questions according to the scheme 
described above with teams of three to five raters achieving 95% accuracy in categorizing items. 
Disagreements were settled by consensus. The use of rubrics for each open-ended question 
produced over 95% agreement by two to four raters each. Again, disagreements were settled by 
consensus. 

Findings 

The coding categories and rating levels captured differences in enactment by teacher 
throughout all lesson sequences. Ratings also indicated teachers were fairly consistent in their 
enactments. This finding was backed up by the descriptions of specific observation of enactment 
written in the justification statements. More importantly, this method of describing enactment 
made possible the identification of two groups of enactments. Two teachers' enactments tended 
to be a good match for the intended enactment whereas the other two teachers' enactments were 
less reflective of the intended enactment. Moreover, the distinction between the groups was 
evident not only in the ratings across analysis categories, but also in the specific aspects of 
enactments that led to the assigned ratings. In each case, the match of individual teacher's 
enactment to the respective group was quite reliable. 

These groups were also distinguished by students’ achievement scores. Effect sizes were 
statistically significant on high and medium cognitive level questions for students in the first 
group and were not statistically significant on high cognitive level questions for the second group 
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(Table 2). Interestingly, only the category of accuracy was not a unique indicator for either group 
or for student achievement. Teachers who presented science accurately were in both groups. 

This analysis identified eight main analysis categories: accuracy and completeness of 
science ideas presented, amount student learning opportunities, similarity of learning 
opportunities with those intended, and quality of adaptations, and amount of instructional 
supports offered, appropriateness of instructional supports and source of ideas for instructional 
supports. Rating levels for each category were described (Table 1). These rating levels were 
effective in discriminating different levels of enactment. 

The careful examination of justification statements for patterns in each rating made 
possible the identification of two to six types of evidence for each main rating category. For 
example, the types of evidence for instructional supports that guided observations and ratings 
included: 1) types of instructional supports -questions, hints and reminders, and real life 
examples and cormections to a driving question, and 2) activities when instructional support 
were used -whole class set up and discussion, small-group work, and student presentations. By 
rating each of the types of evidence an overall rating for the category was possible and justified. 

The identification of eight categories, rating levels, and types of evidence has made it 
possible to construct scoring rubrics for each category (see Figures 1 - 8). Further, the specific 
examples used to justify the assigned ratings during the analysis of enactment data described the 
characteristics of the evidence types that were consistent with high or low ratings for the 
category in general. For example, under instructional supports when questions are used to guide 
students to consider important content ideas this evidence contributes to a high rating. 
Conversely, when questions are used to elicit definition this evidence contributes to low ratings 
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in for instructional strategies (Figure 6). These descriptions have been added to all rubrics to 
guide evaluations. 

Discussion 

Measures used to research teaching in reform need to be reflective of the reform goals. 
The eight analysis categories are consistent with reform recommendations because they were 
developed from a reform-based curriculum framework. Moreover, these categories are not 
specific to the unit used to develop them. Rather the categories should be adaptable to any 
reform-oriented science program. Any quality program will be concerned about how content is 
presented, that students have opportunities to learn and that teachers give students guidance and 
support. The categories were able to separate teachers enactments into two groups that are 
correspond to two groups indicated by student achievement scores. This suggests the categories 
and rating levels are capturing something important about teachers’ practices that lead to student 
learning. 

The link from specific aspects of classroom teaching to student learning is an important 
one. Whereas others have shown that teachers effect student learning they have not identified 
specific instructional practices that lead to improve student outcomes (Darling-Hammond, 1999). 
In addition, it is an important finding that measuring specific teacher behaviors is not sufficient 
to determine quality of teaching (Ball & Cohen, 1999; Borko, Cone, Russo, & Shavelson, 1979; 
Lamport, 1998). It is not enough to know whether teachers are asking questions. It was also 
important to consider teachers’ goals in asking these questions. If we can learn what teachers can 
do to help students learn we also can learn what types of support teachers need to learn and enact 
these practices. 
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The rubrics are based on enactment data and are formatted to facilitate scoring of 
enactments directly. This should eliminate the need to collect, prepare, and analyze videotape or 
detail descriptions of classroom events. However, these rubrics have not been field-tested. 
Although the categories and types of evidence have proven to be useful and informative, other 
types of evidence may emerge from further observations of reform-based enactments. In 
addition, we do not know if it is possible to score enactments in real time or if the rubrics will be 
more usable with videotape that can be paused and rewound. Although much simpler than 
careful qualitative analysis, these rubrics remain complex. It is likely improvements can be made 
in rubrics based on use in classrooms or enactment videotape. 

The process used to identify categories, rating levels, and specific types of evidence that 
could be used to characterize teaching was time and labor intensive. However, now that these 
have been identified future evaluations will be much simpler. Further studies with more teachers 
enacting reforms would increase the reliability of these recommendations. Through this work, an 
observation framework that is appropriate for larger scale studies could be created. These 
categories will be presented in a format easily adapted to various classrooms and curriculum. 
This will make the much needed large-scale studies of teacher enactments feasible. 

We developed these rubrics to evaluate the efficacy of teacher-educative, reform-based 
curriculum materials but they can be adapted to use in other research questions. For example, 
Davis (2002) is using student teachers’ unit plans to answer questions about how novices learn to 
teach. Others are using reform-based materials to promote student learning (Prawat & et al., 

1992; Songer, Lee, & Kam, 2002). An evaluation scheme like the one presented here would be 
helpful to gauge how closely enactment reflects the intended curriculum plan without looking for 
strict implementation (Apple & Jungck, 1990). 
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This importance of this work lies in its ability to provide a tool to facilitate research on 
teaching. One area of weakness is the lack of studies that bridge the gap between teacher 
preparation, classroom teaching, and student outcomes on a large scale. We know that teachers 
need to learn about teaching in the context of the classroom but we do not know how to 
efficiently support that learning (Putnam & Borko, 2000). Although we used this observation 
framework to inform the design of materials to support teachers in reform, this framework will 
be valuable in many areas of research on teaching. A method to evaluate teaching that is 
meaningful and usable on a large scale is needed to inform teacher education and professional 
development research. 
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Table 1 Categories and rating levels of coding scheme used to analyze classroom enactment data 

Accuracy 

Scientific - all ideas are consistent with current scientific ideas 

Sufficient -consistent with current scientific ideas for all main ideas, inaccurate for minor ideas 
Semi accurate - inconsistent with current scientific ideas for some main ideas 
Non scientific - inconsistent with current scientific ideas for many main ideas 

Completeness 

Thorough - all the appropriate science ideas are addressed 

Sufficient - all the appropriate main ideas are addressed but some minor ideas are missing 
Incomplete - missing some main ideas 
Insufficient - missing several main ideas 

Excessive - includes ideas at a level beyond intended for students 

Opportunities 

Maximum - includes ample (number or time) opportunity for student learning 
Sufficient - includes some (number or time) opportunity for student learning 
Insufficient - includes few (number or time) opportunity for student learning 
Minimal - includes almost no (number or time) opportunity for student learning 

Similarity 

High - matched to intended lesson 

Medium - closely resembles intended lesson, minor changes 
Low - faintly resembles, major changes 
None - not consistent with intended lesson 

Adaptation 

High - adaptation consistent with learning goal and appropriate for students' learning needs 
Medium - adaptation consistent with learning goal but not appropriate for students' learning 
needs 

Low - adaptation not consistent with learning goal 
None - not adapted 

Instructional Supports 

High - provides many instructional supports for student thinking 
Medium - provides some instructional supports for student thinking 
Low - provides few instructional supports for student thinking 
None - provides no instructional supports for student thinking 

Appropriateness 

Excellent - instructional supports always used in ways matched to student learning needs 
Sufficient - instructional supports usually used in ways matched to student learning needs 
Insufficient - instructional supports usually not used in ways matched to student learning needs 
Poor - instructional supports always used in ways not matched to student learning needs 

Sources 

Supplemented - used instructional supports included in materials plus others 
Matched - used only instructional supports included in the materials 
Replaced -used only instructional supports not included in materials 
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Table 2 



Student performance on pre- and post-tests for each teacher. 



Pre-test Post-test 


Effect Size® 


M (SD) M (SD) 





Enactment Group One 



Ms Franklin, Fall 1998 (N = 29) 



High level (18 points) 


1.66(1.08) 


3.97 (2.23) 


2 14*** 


Medium level (19 points) 


6.34(1.45) 


10.03 (2.23) 


2 54*** 


Low level (16 points) 


8.03 (2.28) 


9.59(3.21) 


0.68* 


Overall (53 points) 


16.03 (3.45) 


23.59(6.16) 


2 


Ms Wells, Fall 1999 (N = 56) 


High level (4 points) 


0.63 (1.59) 


1.25 (1.96) 


1.06*** 


Medium level (9 points) 


3.79(1.39) 


4.41 (1.69) 


0.45* 


Low level (8 points) 


2.73 (1.27) 


3.63 (1.36) 


0.70*** 


Overall (21 points) 


7.14(2.11) 


9.29 (3.04) 


1 Ql*** 


Enactment Group Two 

Mr. Davis, Fall 1999 (N = 25) 


High level (4 points) 


0.44 (0.65) 


0.72(1.02) 


0.43 


Medium level (9 points) 


3.60(1.32) 


3.72(1.51) 


0.09 


Low level (8 points) 


2.40(1.29) 


4.40(1.85) 


1.55*** 


Overall (21 points) 


6.44(1.87) 


8.84 (3.16) 


1.28*** 


Ms Turner, Fall 1998, (N = 25) 


High level (18 points) 


0.88 (1.05) 


0.88(1.01) 


0.00 


Medium level (19 points) 


5.04(1.90) 


6.00 (2.40) 


0.51* 


Low level (16 points) 


4.48 (2.22) 


6.00 (2.87) 


0.68* 


Overall (53 points) 


10.40 (3.31) 


12.88 (5.10) 


0.75** 



^Effect Size: effect size was calculated by the difference between the means divided by the 
standard deviation of the pre-test. 

*P<.05. **E<.01. ***g<.001. 
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Accuracy of science ideas presented 



Types of evidence 


Non scientific 
Inconsistent 
with current 
scientific ideas 
for many main 
ideas 


Semi accurate 
Inconsistent with 
current scientific 
ideas for some 
main ideas 


Sufficient 
Consistent with 
current scientific 
ideas for all main 
ideas but 
inaccurate for 
minor ideas 


Scientific 
All ideas are 
consistent with 
current scientific 
ideas 


Explicit statements 

Definitions 










Explanations 










Examples 










Guidance 

Direction of student 
attention to tasks 


Completion of tasks Conceptually important aspects 

Irrelevant factors Appropriate ideas 










Guidance in 
connection with student 
predictions, hypothesis 
or conclusions 


Little guidance Guided students to appropriate form 

Guided students to inappropriate statements 

form of statements 










Guidance in 

connection with student 
investigation design 


Little guidance Guided students to complete and 

Guided students to incomplete or appropriate design 

inappropriate design 










Response to students 

Accurate and 
inaccurate student 
statements 


Not distinguished Distinguished; inaccurate redirected, 

accurate acknowledged 










Inaccurate student 
statements during 
presentations 


Generally not corrected Corrected 










Overall rating 











Figure 1: Types of evidence and rating levels for the category of Accuracy 
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Completeness of science ideas presented 



Types of evidence 


Insufficient 

Missing several 
main ideas 


Incomplete 

Missing some 
main ideas 


Sufficient 
All the 

appropriate main 
ideas are 
addressed but 
some minor ideas 
are missing 


Thorough 

All the 
appropriate 
science ideas are 
addressed 


Intended content 

Concepts intended for 
the lesson sequence 


Not addressed or only defined 




Addressed 










Process ideas 
regarding investigations 
(variables and design) 










Process ideas 
regarding graph reading 
and interpretation 










Generalizable 

statements 










Connections between 
ideas 


Not explicit or not made 


Explicitly addressed 












Excessive 

Includes ideas at a level beyond intended for students 


Outside content 


Yes 


No 


Content beyond that 
intended 










Overall rating 











Figure 2: Types of evidence and rating levels for the category of Completeness 
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Amount of Student Learning Opportunities 



Types of evidence 


Minimal 

Includes almost 
no (number or 
time) 

opportunity for 
student learning 


Insufficient 

Includes few 
(number or time) 
opportunity for 
student learning 


Sufficient 
Includes some 
(number or time) 
opportunity for 
student learning 


Maximum 
Includes ample 
(number or time) 
opportunity for 
student learning 


Time 


Class time short for all activities adequate in class time for each type of 

except final student presentations activity 










Type of activity 

Actions 


Incomplete completed 










Small -group work 


Limited Frequent 

Includes little thoughtful work Included action and thoughtful work 










Discussion 


Limited Frequent 

Few student ideas used Used student ideas 










Structure 

Activities 


Clustered by type Sequenced and cycled 










Small-group work 


Monitored closely for completion Monitored but not overly structured 

Students allowed to discuss and work 

together 










Discussions 


Presented teacher ideas and Used student ideas 

explanations , , , . , , , 

hither clearly jocused and directed or 

followed student ideas 










Investigations 


Structured by list of items to Structured by question 

complete 

Not structured 










Overall rating 











Figure 3: Types of evidence and rating levels for the category of Opportunities 
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Similarity of Student Learning Opportunities 



Types of evidence 


None 

Not consistent 
with intended 
lesson 


Low 

Faintly 

resembles, major 
changes 


Medium 

Closely 
resembles 
intended lesson, 
minor changes 


High 
Matched to 
intended lesson 


Major learning 
opportunities 

Overall opportunities 










Phases of 
opportunities 










Sequence 

Of overall 
opportunities 










Of phases of 
opportunities 


Combines like activities 










Emphasis 










Overall rating 











Figure 4: Types of evidence and rating levels for the category of Similarity 
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Adaptation of Student Learning Opportunities 



Types of evidence 


None 
not adapted 


Low 

adaptation not 
consistent with 
learning goal 


Medium 
adaptation 
consistent with 
learning goal but 
not appropriate 
for students' 
learning needs 


High 
Adaptation 
consistent with 
learning goal and 
appropriate for 
students' learning 
needs 


Additions 


Group presentations Does not adapt 

Non-content supporting features More whole class activities to address 

students * questions 

Investigation features such as 
variables 

Final presentation features such as 
questions or demonstrations of design 










Changes 


Teacher-led activities changed to 
student activities 

Small-group activities changed to 
individual work 


Overall rating 











Figure 5: Types of evidence and rating levels for the category of Adaptations 
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Amount of Instructional Supports 



Types of evidence 


None 
Provides no 
instructional 
supports for 
student thinking 


Low 

provides few 
instructional 
supports for 
student thinking 


Medium 

provides some 
instructional 
supports for 
student thinking 


High 

provides many 
instructional 
supports for 
student thinking 


Types 

Questions 


Used to elicit definitions or Used to guide students to important 

sometimes explanations content ideas 










Hints and reminders 


Used as lists of items to complete Used to focus attention on content 

related aspects of activity and to 
guide doing a task 










Real life examples and 
connections to driving 
question 


Rarely or occasionally used Frequent 










Activities 

Whole class set-up and 
discussion 


Few supports; tasks may be student 
self-guided work 










Small-group work 


Frequent prompts to complete Few interruptions 










Presentations 


Few supports Guiding questions 










Overall rating 











Figure 6: Types of evidence and rating levels for the category of Instructional Supports 
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Appropriateness of Instructional Supports 



Types of evidence 


Poor 

Instructional 
supports always 
used in ways not 
matched to 
student learning 
needs 


Insufficient 
Instructional 
supports usually 
not used in ways 
matched to 
student learning 
needs 


Sufficient 
Instructional 
supports usually 
used in ways 
matched to 
student learning 
needs 


Excellent 
Instructional 
supports always 
used in ways 
matched to 
student learning 
needs 


Questions and prompts 


Answered or explained by the Guide students to focus on 

teacher appropriate ideas 

Guide students to definitions or 
voting on right answers 










Hints and reminders 


Address task completion Address ideas with which students 

may have trouble 










Students ideas 


Not requested Requested 

Connected to previously stated 
students* ideas 










Feedback 


Identifies mistakes or wrong answers Directs students to appropriate ideas 










Student questions and 
difficulties 


Not addressed Addressed 










Overall rating 











Figure 7: Types of evidence and rating levels for the category of Appropriateness 
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Sources of Instructional Supports 



Types of evidence 


Replaced 

Used only instructional 
supports not included in 
materials 


Matched 

Used only instructional 
supports included in the 
materials 


Supplemented 

Used instructional 
supports included in 
materials plus others 


From materials 


Many suggested supports not used Uses questions to guide discussion 

Uses driving question 

Comparisons to similar previous 
activities 

Monitored groups 








Teacher added 


None or Real-life examples 

Prompts for task completion and Supports from earlier parts of the 

definitions materials to later lesson sequences 








Trend 


Matches throughout or matches early j 

then quickly replaces 


batches early, but quickly 
supplements 








Overall rating 









Figure 8: Types of evidence and rating levels for the category of Sources 
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